I am analyzing a trace in GTFS, a collection of CSV files. Each record of a csv file is consisting of one or more fields speparated by commas ,. It is error-prone to read csv files by simply using [line.split(',') for line in f.readlines()] for fields might contain commas. Therefore, I decide to switch to the Python module csv.

1. CSV format

CSV (Comma-separated values) files store tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas[1]. CSV format is the most common import and export format for spreadsheets and databases. Note that fields with embedded commas must be quoted (enclosed within double-quote characters) to avoid confusions. For instance,

stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,location_type,parent_station
3666595,,"227, AVENUE DES NATIONS","AVENUE DE LA PYRAMIDE - 93073",48.98292708588317,2.5197046726137797,0,

The Python module csv is designed to read and write csv files, primarily including:

  • csv.reader(...) and csv.writer(...), read and write sequences
  • csv.DictReader(...) and csv.DictWriter(...), read and write data in dictionary form

2. Read and write sequences

2.1 Read a csv file into a list of lists

csv.reader(csvfile, dialect='excel', **fmtparams) returns a reader object which will iterate over lines in the given `csvfile.

import csv

# Usage
csv.reader(csvfile, dialect='excel', **fmtparams) # Return a reader object which will iterate over lines in the given csvfile. 

with open(file_preprocess, 'r') as f:
    lists = [row for row in csv.reader(f, delimiter=',')] # read a csv file into a list of lists

2.2 Write a list of lists to a csv file

csv.writer(csvfile, dialect='excel', **fmtparams) returns a writer object responsible for converting the user’s data into delimited strings on the given file-like object. csvfile can be any object with a write() method.

# Usage:
csv.writer(csvfile, dialect='excel', **fmtparams) # return a writer object

def write_lists_to_csv(filename, lists, fieldnames=None):
    with open(out_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)

        # add header 
        if fieldnames:
            writer.writerow(fieldnames)

        writer.writerows(lists)

Using newline='' is to disable universal newlines translation on all platforms.

2.3 dialect and fmtparams

(1) dialect

The value of dialect can be one of:

  • excel, the usual properties of an Excel-generated CSV file
  • excel-tab, the usual properties of an Excel-generated TAB-delimited file
  • unix, the usual properties of a CSV file generated on UNIX systems, i.e. using '\n' as line terminator and quoting all fields

(2) fmtparams

The parameter fmtparams is described in Dialects and Formatting Parameters. To make it easier to specify the format of input and output records, specific formatting parameters are grouped together into dialects. Dialects support the following attributes:

delimiter=','           # A one-character string used to separate fields
doublequote=True        # When True, a field within `quotechar` should be quoted; When False, the escapechar is used as a prefix to the quotechar
lineterminator='\r\n'   # The string used to terminate lines produced by the writer. Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator.
quotechar='"'           # A one-character string used to quote fields containing special characters
escapechar=QUOTE_NONE   # Escaping is disable by default.
quoting='QUOTE_MINIMAL' # Controls when quotes should be generated by the writer and recognised by the reader. 
skipinitialspace=False  # When True, whitespace immediately following the delimiter is ignored.
strict=False            # When True, raise exception Error on bad CSV input.

2. Read and write data in dictionary form

class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) # Create an object that maps the information read into a dict 
class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds) # Create an object that maps dictionaries onto output rows.

Appendix: file open mode

The available modes for open(...) are:

  • r open for reading (default)
  • w open for writing, truncating the file first
  • x open for exclusive creation, failing if the file already exists
  • a open for writing, appending to the end of the file if it exists
  • b binary mode
  • t text mode (default)
  • + open a disk file for updating (reading and writing)
  • U universal newlines mode (deprecated since 3.4)

References:

[1] Wikipedia: Comma-separated values

本文系Spark & Shine原创,转载需注明出处本文最近一次修改时间 2022-12-20 22:58

results matching ""

    No results matching ""